Important: The dataset we were used to create the network comes from Twitter, you can view and download them from our repo. The Explainer notebook

Introduction

In this project we focus on security people - as the world of IT becomes more widespread and increasingly complex, many security vulnerabilities arise which could potentially have an enormous impact on a company or worse case a whole country. There are a few guardians who dedicate their life to secure the IT infrastructure so we all can sleep peacefully.

We use the data from Twitter and build the network of security people based the friend concept of Twitter. With the network in hand, we perform a community calculation to find out whether these people are following into groups. And also understand what are they talking about by building the word cloud for each community. Finally, we detected the sentimentality of each community by analyzing the text of each person's biography that falls in the community.

Dataset

Twitter is the main source of our dataset. We used Twitter API combined with Tweepy library to download our dataset.

After crawling, cleaning, and formatting, 2 million rows of records have remained and gives us information like the name of the security people, the friends relationship with others which can be used to build the network later.

The overall size of the raw dataset is over 130 MB, which was extracted from over 3543 tweets. The 3543 tweets is just a start point of our dataset. By extracting the authors of the tweets and retrieve their friends, we get the main data of the network. As mentioned before, we also download all the biography of each people for the sentimentality analysis.

Network

The security people themselves will be the nodes in our network, and the concept of friends in Twitter will be the edges.

Our network is a graph to showcase who are friends with each other.

# collapse-hide
fig = plt.figure(figsize=(20, 10))
nx.draw_networkx_nodes(g, positions, node_size=node_sizes, alpha=0.4)
nx.draw_networkx_edges(g, positions, edge_color="black", alpha=0.05, width=0.5)
plt.title("Security People Network")
plt.axis('off')
fig.show()
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2020-12-02T13:24:09.614828 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

Statistics of the network

# collapse-hide
print("Top ten nodes sorted by degree")
sorted(g.degree, key=lambda x: x[1], reverse=True)[:10]
Top ten nodes sorted by degree
[('@HackingDave', 216),
 ('@AlyssaM_InfoSec', 198),
 ('@RayRedacted', 193),
 ('@NicoleBeckwith', 178),
 ('@DfirDiva', 170),
 ('@sherrod_im', 161),
 ('@cybergeekgirl', 161),
 ('@gabsmashh', 160),
 ('@LisaForteUK', 158),
 ('@UK_Daniel_Card', 154)]

# collapse-hide
print('Number of nodes', g.number_of_nodes())
print('Number of edges', g.number_of_edges())
Number of nodes 2050
Number of edges 18040

NLP

We want to explore how many security people there are and find out if they generally know each other. Find out if they can be split into some communities. Within these communities, we could try to use some Natural Language Processing to detect which community speaks most loudly and potentially see if there is a sentimentally difference.

Communities

# collapse-hide
hist, bin_edges = np.histogram(list(len(com) for com in security_communities.values()))
center = ((bin_edges[:-1] + bin_edges[1:]) / 2).round()
fig = plt.figure(figsize=(20, 10))
plt.bar(center, hist)
plt.title("Security community sizes")
plt.ylabel("Count")
plt.xlabel("Community size")
plt.xticks(center)
fig.show()
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2020-12-02T13:24:13.796575 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
top_5_largest_communites = sorted(security_communities.values(), key=len, reverse=True)[:5]
with open("bios.csv", newline="") as f:
    csv_reader = csv.DictReader(f)
    bio_by_name = {row["screen_name"]: row["bio"] for row in csv_reader}

bios_by_community = {i: [bio_by_name.get(name, "") for name in members] for i, members in enumerate(top_5_largest_communites)}

# collapse-hide
wordcloud = WordCloud(
    max_words=100,
    collocations=False,
)

fig, axs = plt.subplots(nrows=len(tfidfs), ncols=1, figsize=(20,20))
for i, tfidf in enumerate(tfidfs):
    wordcloud.generate_from_frequencies(tfidf)
    axs[i].set_title(f"Community {i+1}")
    axs[i].imshow(wordcloud, interpolation="bilinear")
    axs[i].axis("off")


fig.show()
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"> 2020-12-02T13:24:18.300630 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

Sentimentality

Are security people sentitive? Lets find out!

# collapse-hide

def compute_average_sentiment(tokens):
    """compute_average_sentiment returns the average sentiment value of the tokens.
    
    Each token in tokens must be in lowercase.
    """
    sentiment = 0.0
    if not len(tokens):
        return sentiment

    avg = np.nan_to_num(words_of_happiness[words_of_happiness["word"].isin(tokens)]["happiness_average"].mean())
    return avg

communities = {i: set(members) for i, members in enumerate(top_5_largest_communites)}
text_of_communities = collections.defaultdict(str)

with open("sentiment_tweets.csv", newline="") as f:
    csv_reader = csv.DictReader(f)
    for row in csv_reader:
        for i, members in communities.items():
            if row["screen_name"] in members:
                text_of_communities[i] += f" {row['tweets']}"

sentiment_of_communities = {k: compute_average_sentiment(bag_of_words(v)) for k, v in text_of_communities.items()}
for com, sentiment in sentiment_of_communities.items():
    print(f"Community {com} have a sentiment value of {sentiment}")
Community 3 have a sentiment value of 5.514023823358513
Community 1 have a sentiment value of 5.4363636363636365
Community 4 have a sentiment value of 5.458832378223495
Community 0 have a sentiment value of 5.455909920876445
Community 2 have a sentiment value of 5.5166643454039